library(tidyverse)Into the Tidyverse
Common tasks in working with data include actions like removing rows or columns, performing calculations, or adding new columns. This sort of operations is known as data manipulation. It is the process of cleaning, organizing, and transforming raw data into a more structured and usable format for analysis.
In this workshop, we’ll guide you through the process of data manipulation in R, starting with the tidyverse. The tidyverse is a collection of packages that align with a data science philosophy developed by Hadley Wickham and the RStudio team. Many users find it to be a more intuitive way to grasp R concepts. Although some tasks may be more straightforward in base-R, we’ll also highlight those aspects.
If you’ve already installed the tidyverse package (if not, you can do so by running the command: install.packages("tidyverse")), let’s proceed to load it into our R session:
The following are key techniques outlined in Hadley Wickham and Garrett Grolemund’s book, R for Data Science:
This workshop focuses on:
- Import:
readrto import data. - Tidy:
tidyrto organize rows of data into unique values. - Transform:
dplyrto perform data manipulation, involving tasks such as subsetting by rows or columns, sorting, and joining. - Visualize:
ggplot2to create static plots, applying the principles of the grammar of graphics.
This workflow is of utmost importance. Instead of constructing analyses based on the unconventional format of your data, take measures to tidy up your data. Tidy data enables the use of number of analytical and visualization tools. It eliminates the need to develop ad-hoc methods to accommodate your data. This not only saves time but also enhances the clarity and comprehensibility of your work, benefiting both your collaborators and, most importantly, your future self.
Step 1: Importing data
First we need to import data into our R session. This can be achieved either by using datasets bundled with R packages or by importing external data into our workspace for data manipulation. This is the first step in the tidyverse workflow.
Preloaded Data in R Packages
A great way to learn data science tools is using the data provided by R packages. The data() function is a convenient way to explore and import pre-loaded datasets that come bundled with the R environment.
library(tidyverse)
data()View Output
To import pre-loaded datasets bundled with a particular package (e.g., tidyr):
data(package = "tidyr")View Output
When you load a package the pre-loaded datasets automatically imported into you R environment. Therefore you can access them directly as follows.
householdOutput
# A tibble: 5 × 5
family dob_child1 dob_child2 name_child1 name_child2
<int> <date> <date> <chr> <chr>
1 1 1998-11-26 2000-01-29 Susan Jose
2 2 1996-06-22 NA Mark <NA>
3 3 2002-07-11 2004-04-05 Sam Seth
4 4 2004-10-10 2009-08-27 Craig Khai
5 5 2000-12-05 2005-02-28 Parker Gracie
?householdView Output
| household | R Documentation |
Household data
Description
This dataset is based on an example in vignette(“datatable-reshape”, package = “data.table”)
Usage
household
Format
A data frame with 5 rows and 5 columns:
- family
-
Family identifier
- dob_child1
-
Date of birth of first child
- dob_child2
-
Date of birth of second child
- name_child1
-
Name of first child
?
- name_child2
-
Name of second child
Reading the Data
At some point, you want to apply what you’ve learned to your own data. In this section, you’ll learn the basics of reading data files into R using the readr package. The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data.
We will use the read_csv() function from readr package to import a dataset. (See also read.csv() in base R.) CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.
The first argument of the read_csv() function takes the path to the file (or a web link). The following code will work if the cms_hospital_patient_satisfaction_2016_sampled.csv file is in the data/patient_satisfaction path.
# here the first argument is a path
cms_data <- read_csv("data/patient_satisfaction/cms_hospital_patient_satisfaction_2016_sampled.csv")Output
Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, Facility Name, County, Hospital Type
dbl (4): Star Rating, No of Surveys, Response Rate, Overall Rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# here the first argument is a web link
mtvcars <- read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")Output
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This command prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
Other file types
Once you’ve mastered read_csv(), using other functions in the readr package is straightforward. It’s just a matter of knowing which function to use:
read_csv2()reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker or thousands seperator.read_tsv()reads tab-delimited files.read_delim()reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.
Exploring the Data
In the previous section we imported a dataset that is bundled with the tidyr package, into a dataframe named household. In the next section we read in a CSV file and created a data frame named cms_data. This section demonstrates different ways to get to know these two data objects.
The class() function is used to identify the data type or data structure of an object (or variable):
class(household)
class(cms_data)Output
[1] "tbl_df" "tbl" "data.frame"
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:
cms_dataOutput
# A tibble: 15 × 8
ID `Facility Name` County `Hospital Type` `Star Rating` `No of Surveys`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 050424 SCRIPPS GREEN HO… SAN D… Acute Care Hos… 4 3110
2 140103 ST BERNARD HOSPI… COOK Acute Care Hos… 1 264
3 100051 SOUTH LAKE HOSPI… LAKE Acute Care Hos… 2 1382
4 040062 MERCY HOSPITAL F… SEBAS… Acute Care Hos… 3 2506
5 440048 BAPTIST MEMORIAL… SHELBY Acute Care Hos… 2 1799
6 450011 ST JOSEPH REGION… BRAZOS Acute Care Hos… 3 1379
7 151317 GREENE COUNTY GE… GREENE Critical Acces… 3 114
8 061327 SOUTHWEST MEMORI… MONTE… Critical Acces… 4 247
9 490057 SENTARA GENERAL … VIRGI… Acute Care Hos… 4 619
10 110215 PIEDMONT FAYETTE… FAYET… Acute Care Hos… 2 1714
11 050704 MISSION COMMUNIT… LOS A… Acute Care Hos… 3 241
12 100296 DOCTORS HOSPITAL MIAMI… Acute Care Hos… 4 393
13 440003 SUMNER REGIONAL … SUMNER Acute Care Hos… 4 680
14 501339 WHIDBEY GENERAL … ISLAND Critical Acces… 3 389
15 050116 NORTHRIDGE MEDIC… LOS A… Acute Care Hos… 3 1110
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>
The dim() function prints the dimensions (rows x columns) of the data frame:
dim(cms_data)Output
[1] 15 8
This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).
The nrow() function prints the number of rows while ncol() prints the number of columns:
nrow(cms_data)
ncol(cms_data)Output
[1] 15
[1] 8
The View() function gives a spreadsheet-like view of the data frame:
View(cms_data)By clicking the object on the environment tab also gives a spreadsheet-like view of the object:
The glimpse()function (dplyr package) displays a compact summary of the data frame, showing you key details such as the data types of each column, the first few values, and the total number of observations.
glimpse(cms_data)Output
Rows: 15
Columns: 8
$ ID <chr> "050424", "140103", "100051", "040062", "440048", "45…
$ `Facility Name` <chr> "SCRIPPS GREEN HOSPITAL", "ST BERNARD HOSPITAL", "SOU…
$ County <chr> "SAN DIEGO", "COOK", "LAKE", "SEBASTIAN", "SHELBY", "…
$ `Hospital Type` <chr> "Acute Care Hospital", "Acute Care Hospital", "Acute …
$ `Star Rating` <dbl> 4, 1, 2, 3, 2, 3, 3, 4, 4, 2, 3, 4, 4, 3, 3
$ `No of Surveys` <dbl> 3110, 264, 1382, 2506, 1799, 1379, 114, 247, 619, 171…
$ `Response Rate` <dbl> 41, 6, 20, 35, 18, 24, 22, 34, 32, 21, 14, 24, 35, 29…
$ `Overall Rating` <dbl> 5, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2
The head() function prints the top 6 rows of a data frame:
head(cms_data)Output
# A tibble: 6 × 8
ID `Facility Name` County `Hospital Type` `Star Rating` `No of Surveys`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 050424 SCRIPPS GREEN HOS… SAN D… Acute Care Hos… 4 3110
2 140103 ST BERNARD HOSPIT… COOK Acute Care Hos… 1 264
3 100051 SOUTH LAKE HOSPIT… LAKE Acute Care Hos… 2 1382
4 040062 MERCY HOSPITAL FO… SEBAS… Acute Care Hos… 3 2506
5 440048 BAPTIST MEMORIAL … SHELBY Acute Care Hos… 2 1799
6 450011 ST JOSEPH REGIONA… BRAZOS Acute Care Hos… 3 1379
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>
Similarly, the tail() function prints the bottom 6 rows of the data frame:
tail(cms_data)Output
# A tibble: 6 × 8
ID `Facility Name` County `Hospital Type` `Star Rating` `No of Surveys`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 110215 PIEDMONT FAYETTE … FAYET… Acute Care Hos… 2 1714
2 050704 MISSION COMMUNITY… LOS A… Acute Care Hos… 3 241
3 100296 DOCTORS HOSPITAL MIAMI… Acute Care Hos… 4 393
4 440003 SUMNER REGIONAL M… SUMNER Acute Care Hos… 4 680
5 501339 WHIDBEY GENERAL H… ISLAND Critical Acces… 3 389
6 050116 NORTHRIDGE MEDICA… LOS A… Acute Care Hos… 3 1110
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>
The colnames() function displays all the column names:
colnames(cms_data)[1] "ID" "Facility Name" "County" "Hospital Type"
[5] "Star Rating" "No of Surveys" "Response Rate" "Overall Rating"
The $ symbol allows access to individual columns. To display ‘Hospital Type’ column:
cms_data$Hospital TypeError: <text>:1:19: unexpected symbol
1: cms_data$Hospital Type
^
Since the column names contain spaces, they need to be enclosed within `` (backticks) for R to interpret it as a single variable:
cms_data$`Hospital Type` # or cms_data$"Hospital Type"It is a good practice to rename all the columns with spaces into a format that R can interpret. Conventionally, _ (underscore) is used to separate words in column names and variables instead of (space). We can rename a single column using the rename() function:
cms_data <- rename(cms_data, Hospital_Type = "Hospital Type")
head(cms_data)Output
# A tibble: 6 × 8
ID `Facility Name` County Hospital_Type `Star Rating` `No of Surveys`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 050424 SCRIPPS GREEN HOSPI… SAN D… Acute Care H… 4 3110
2 140103 ST BERNARD HOSPITAL COOK Acute Care H… 1 264
3 100051 SOUTH LAKE HOSPITAL LAKE Acute Care H… 2 1382
4 040062 MERCY HOSPITAL FORT… SEBAS… Acute Care H… 3 2506
5 440048 BAPTIST MEMORIAL HO… SHELBY Acute Care H… 2 1799
6 450011 ST JOSEPH REGIONAL … BRAZOS Acute Care H… 3 1379
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>
Note: To modify the original cms_data object, it is essential to assign the renamed object to cms_data. Otherwise, a copy of the cms_data object is changed.
Or by directly assigning a new column name:
colnames(cms_data)[2] <- "Facility_Name"
head(cms_data)Output
# A tibble: 6 × 8
ID Facility_Name County Hospital_Type `Star Rating` `No of Surveys`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 050424 SCRIPPS GREEN HOSPI… SAN D… Acute Care H… 4 3110
2 140103 ST BERNARD HOSPITAL COOK Acute Care H… 1 264
3 100051 SOUTH LAKE HOSPITAL LAKE Acute Care H… 2 1382
4 040062 MERCY HOSPITAL FORT… SEBAS… Acute Care H… 3 2506
5 440048 BAPTIST MEMORIAL HO… SHELBY Acute Care H… 2 1799
6 450011 ST JOSEPH REGIONAL … BRAZOS Acute Care H… 3 1379
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>
Using the above methods to rename multiple columns just to replace space with underscore can be laborious. A quick way to replace all spaces in the column names with underscore is shown below.
Option 1:
# by string substituting _ in place of ' '(space)
colnames(cms_data) <- gsub(" ", "_", colnames(cms_data))
colnames(cms_data)Output
[1] "ID" "Facility_Name" "County" "Hospital_Type"
[5] "Star_Rating" "No_of_Surveys" "Response_Rate" "Overall_Rating"
Here the gsub() function replace space with underscores in the column names of the cms_data. This new object needs to be assigned (<-) to colnames(cms_data) in order for the changes to be saved in the cms_data object.
Option 2:
Here we use the janitor package which contains functions to clean data. You first need to install and load the library before using it.
# using janitor package
library(janitor) # remember to install janitor: install.packages("janitor")
cms_data <- cms_data |> clean_names()
cms_dataOutput
# A tibble: 15 × 8
id facility_name county hospital_type star_rating no_of_surveys
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 050424 SCRIPPS GREEN HOSPITAL SAN D… Acute Care H… 4 3110
2 140103 ST BERNARD HOSPITAL COOK Acute Care H… 1 264
3 100051 SOUTH LAKE HOSPITAL LAKE Acute Care H… 2 1382
4 040062 MERCY HOSPITAL FORT SM… SEBAS… Acute Care H… 3 2506
5 440048 BAPTIST MEMORIAL HOSPI… SHELBY Acute Care H… 2 1799
6 450011 ST JOSEPH REGIONAL HEA… BRAZOS Acute Care H… 3 1379
7 151317 GREENE COUNTY GENERAL … GREENE Critical Acc… 3 114
8 061327 SOUTHWEST MEMORIAL HOS… MONTE… Critical Acc… 4 247
9 490057 SENTARA GENERAL HOSPIT… VIRGI… Acute Care H… 4 619
10 110215 PIEDMONT FAYETTE HOSPI… FAYET… Acute Care H… 2 1714
11 050704 MISSION COMMUNITY HOSP… LOS A… Acute Care H… 3 241
12 100296 DOCTORS HOSPITAL MIAMI… Acute Care H… 4 393
13 440003 SUMNER REGIONAL MEDICA… SUMNER Acute Care H… 4 680
14 501339 WHIDBEY GENERAL HOSPIT… ISLAND Critical Acc… 3 389
15 050116 NORTHRIDGE MEDICAL CEN… LOS A… Acute Care H… 3 1110
# ℹ 2 more variables: response_rate <dbl>, overall_rating <dbl>
Here we are sending the cms_data data frame into the function clean_names() which replaces replaces spaces with underscore in column names. Note that it also convert all characters in column names to lower case.
The default format is snake case styled as snake_case. You can specify other formats like camel case, title case.
Option 3:
Additionally, we can use the make.names() function in base R which replaces ’ ’ (spaces) with . (dots).
# using make.names() function
colnames(cms_data) <- make.names(colnames(cms_data))
colnames(cms_data) Output
[1] "ID" "Facility_Name" "County" "Hospital_Type"
[5] "Star.Rating" "No.of.Surveys" "Response.Rate" "Overall.Rating"
The str() function shows the structure of the data:
str(cms_data)Output
spc_tbl_ [15 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : chr [1:15] "050424" "140103" "100051" "040062" ...
$ Facility_Name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
$ County : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
$ Hospital_Type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
$ Star Rating : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
$ No of Surveys : num [1:15] 3110 264 1382 2506 1799 ...
$ Response Rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
$ Overall Rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
- attr(*, "spec")=
.. cols(
.. ID = col_character(),
.. `Facility Name` = col_character(),
.. County = col_character(),
.. `Hospital Type` = col_character(),
.. `Star Rating` = col_double(),
.. `No of Surveys` = col_double(),
.. `Response Rate` = col_double(),
.. `Overall Rating` = col_double()
.. )
- attr(*, "problems")=<externalptr>
The summary() function generates summary statistics:
summary(cms_data)Output
ID Facility_Name County Hospital_Type
Length:15 Length:15 Length:15 Length:15
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Star Rating No of Surveys Response Rate Overall Rating
Min. :1.0 Min. : 114.0 Min. : 6 Min. :2.000
1st Qu.:2.5 1st Qu.: 326.5 1st Qu.:20 1st Qu.:2.000
Median :3.0 Median : 680.0 Median :24 Median :3.000
Mean :3.0 Mean :1063.1 Mean :25 Mean :2.733
3rd Qu.:4.0 3rd Qu.:1548.0 3rd Qu.:33 3rd Qu.:3.000
Max. :4.0 Max. :3110.0 Max. :41 Max. :5.000
A statitical overview can be obtained using the skim() function in skimr package:
library(skimr)
skim(cms_data)Output
| Name | cms_data |
| Number of rows | 15 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 6 | 6 | 0 | 15 | 0 |
| Facility_Name | 0 | 1 | 16 | 32 | 0 | 15 | 0 |
| County | 0 | 1 | 4 | 14 | 0 | 14 | 0 |
| Hospital_Type | 0 | 1 | 19 | 24 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Star Rating | 0 | 1 | 3.00 | 0.93 | 1 | 2.5 | 3 | 4 | 4 | ▁▃▁▇▇ |
| No of Surveys | 0 | 1 | 1063.13 | 909.05 | 114 | 326.5 | 680 | 1548 | 3110 | ▇▁▃▁▁ |
| Response Rate | 0 | 1 | 25.00 | 9.30 | 6 | 20.0 | 24 | 33 | 41 | ▂▇▇▆▆ |
| Overall Rating | 0 | 1 | 2.73 | 0.80 | 2 | 2.0 | 3 | 3 | 5 | ▆▇▁▁▁ |
Writing Data to a File
Writing data to a file is a fundamental operation in programming and data analysis. It involves taking data from within a program or environment and storing it in a file on a disk for later use or sharing. This section explains the basics of writing a data file using the readr package.
The write_csv() and write_tsv() functions are part of the readr package, which is designed for writing delimited files like CSV (comma-separated values) and TSV (tab-separated values). These functions are used to write data frames into CSV and TSV files, respectively.
We first provide the variable name of the data frame followed by the file name (ideally including the full folder location).
To write a CSV file:
# on Mac:
write_csv(cms_data, "~/Desktop/cms_data.csv")
# on Windows
write_csv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")To write a TSV file:
# on Mac:
write_tsv(cms_data, "~/Desktop/cms_data.csv")
# on Windows
write_tsv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")Step 2: Tidy Data
Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, popularized by Hadley Wickham, are designed to promote consistency and ease of use in data analysis.
This is the second step in the tidyverse workflow.
Let’s take a look at some examples.
Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.
For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.
AirPassengersOutput
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?
Data needs to be reshaped to conform to the tidy data structure. It involves using two primary verbs (or pairs of opposites):
- Convert columns into rows (
pivot_longer()). - Convert rows into columns (
pivot_wider()). - Convert a character column into multiple columns (
separate_wider_delim()andseparate_wider_position()). - Combine multiple character columns into a single column (
unite()).
First, load the tidyr package. Since you have already installed the tidyverse, you should be able to load it directly as follows (otherwise install it using the command install.packages("tidyverse") if necessary):
library(tidyverse)Converting data from wide to long format
First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).
counts <- read_csv("data/GSE60450_normalized_data.csv")
head(counts)Output
# A tibble: 6 × 14
X gene_symbol GSM1480291 GSM1480292 GSM1480293 GSM1480294 GSM1480295
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG000… Jak3 82.2 81.2 36.1 36.6 12.4
2 ENSMUSG000… Usp36 88.2 94.6 76.1 63.7 27.1
3 ENSMUSG000… Mir135a-2 0 0 0 0 0
4 ENSMUSG000… Strn4 94.3 85.1 59.5 54.3 24.4
5 ENSMUSG000… Mir6921 0.0895 0.0840 0 0.0846 0.0388
6 ENSMUSG000… Rubcn 37.8 37.7 21.3 27.7 30.1
# ℹ 7 more variables: GSM1480296 <dbl>, GSM1480297 <dbl>, GSM1480298 <dbl>,
# GSM1480299 <dbl>, GSM1480300 <dbl>, GSM1480301 <dbl>, GSM1480302 <dbl>
To transform this table from a wide format to a long format, we use the pivot_longer() function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from pivot_longer() is often necessary for ggplot, where each aesthetic or facet category must be a single column of values and for left_join(), which will be introduced later.
This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.
The
pivot_longer() function takes three arguments:
- cols = : a vector indicating the names of the columns to be converted into labels in long form.
- names_to = : a name or vector of names for the new column(s) containing the labels from the specified columns.
- **values_to =* *: a name for the new column containing the values corresponding to the specified columns.
It’s important to note that when using pivot_wider(), the new column names need to be enclosed in quotes.
seqdata <- counts |>
pivot_longer(cols = starts_with("GSM"),
names_to = "Sample",
values_to = "Count")The cols = starts_with("GSM") command returns a vector of columns whose names starts with “GSM”. pivot_longer() will then transform the those columns into two new columns, denoted as “Sample” and “Count.” The parameter names_to = "Sample" indicates that the new column containing the specified columns (defined by cols) should be named “Sample,” while values_to = "Count" specifies that the new column containing the values should be named “Count.”
seqdataOutput
# A tibble: 240 × 4
X gene_symbol Sample Count
<chr> <chr> <chr> <dbl>
1 ENSMUSG00000031805 Jak3 GSM1480291 82.2
2 ENSMUSG00000031805 Jak3 GSM1480292 81.2
3 ENSMUSG00000031805 Jak3 GSM1480293 36.1
4 ENSMUSG00000031805 Jak3 GSM1480294 36.6
5 ENSMUSG00000031805 Jak3 GSM1480295 12.4
6 ENSMUSG00000031805 Jak3 GSM1480296 11.9
7 ENSMUSG00000031805 Jak3 GSM1480297 10.6
8 ENSMUSG00000031805 Jak3 GSM1480298 14.9
9 ENSMUSG00000031805 Jak3 GSM1480299 7.57
10 ENSMUSG00000031805 Jak3 GSM1480300 7.06
# ℹ 230 more rows
Alternatively, we could achieve the same outcome by specifying a column range using the following command:
seqdata <- counts |>
pivot_longer(cols = GSM1480291:GSM1480302,
names_to = "Sample",
values_to = "Count")We can also specify the columns we don’t want to reformat, and pivot_longer() will then reformat all the columns except those. To achieve this, we place a minus sign (“-”) in front of the column names that we wish to exclude. This is a commonly used approach with pivot_longer(), as it can be more convenient to exclude columns we don’t need rather than explicitly include the ones we want.
seqdata <- counts |>
pivot_longer(cols = -c(X, gene_symbol),
names_to = "Sample",
values_to = "Count")Converting data from long to wide format
First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).
annot <- read_csv("data/GSE60450_annotation.csv")
head(annot)Output
# A tibble: 6 × 3
ENSEMBL Type Annotation
<chr> <chr> <chr>
1 ENSMUSG00000031805 SYMBOL Jak3
2 ENSMUSG00000031805 GENENAME Janus kinase 3
3 ENSMUSG00000033909 SYMBOL Usp36
4 ENSMUSG00000033909 GENENAME ubiquitin specific peptidase 36
5 ENSMUSG00000065524 SYMBOL Mir135a-2
6 ENSMUSG00000065524 GENENAME microRNA 135a-2
To transform this table so that it conforms to the tidy principles, we use the pivot_wider() function.
This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.
The pivot_wider() function takes two arguments:
- names_from = : a name or a vector of names of column(s) containing the labels that will be transformed into the new column names.
- values_from = : a name or a vector of names of column(s) containing the values that will fill the new columns.
In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:
annot_tidy <- annot |>
pivot_wider(names_from = Type,
values_from = Annotation)The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot_wider() results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.
annot_tidyOutput
# A tibble: 20 × 3
ENSEMBL SYMBOL GENENAME
<chr> <chr> <chr>
1 ENSMUSG00000031805 Jak3 Janus kinase 3
2 ENSMUSG00000033909 Usp36 ubiquitin specific peptidase 36
3 ENSMUSG00000065524 Mir135a-2 microRNA 135a-2
4 ENSMUSG00000030374 Strn4 striatin, calmodulin binding protein 4
5 ENSMUSG00000098547 Mir6921 microRNA 6921
6 ENSMUSG00000035629 Rubcn RUN domain and cysteine-rich domain contain…
7 ENSMUSG00000094053 Scgb2b7 secretoglobin, family 2B, member 7
8 ENSMUSG00000055491 Pprc1 peroxisome proliferative activated receptor…
9 ENSMUSG00000053080 Zfta zinc finger translocation associated
10 ENSMUSG00000039715 Dync2i2 dynein 2 intermediate chain 2
11 ENSMUSG00000033475 Tomm6 translocase of outer mitochondrial membrane…
12 ENSMUSG00000026283 Ing5 inhibitor of growth family, member 5
13 ENSMUSG00000037331 Larp1 La ribonucleoprotein 1, translational regul…
14 ENSMUSG00000074489 Bglap3 bone gamma-carboxyglutamate protein 3
15 ENSMUSG00000038246 Fam50b family with sequence similarity 50, member B
16 ENSMUSG00000066189 Cacng3 calcium channel, voltage-dependent, gamma s…
17 ENSMUSG00000005611 Irag1 inositol 1,4,5-triphosphate receptor associ…
18 ENSMUSG00000064299 4921528I07Rik RIKEN cDNA 4921528I07 gene
19 ENSMUSG00000028174 Rpe65 retinal pigment epithelium 65
20 ENSMUSG00000024902 Mrpl11 mitochondrial ribosomal protein L11
It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the names_from column, the output will indeed become wider compared to the input.
Separating Columns
First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).
metadata <- read_csv("data/GSE60450_metadata.csv")
head(metadata)Output
# A tibble: 6 × 2
gene_id characteristics
<chr> <chr>
1 GSM1480291 mammary gland;luminal cells;virgin
2 GSM1480292 mammary gland;luminal cells;virgin
3 GSM1480293 mammary gland;luminal cells;18.5 day pregnancy
4 GSM1480294 mammary gland;luminal cells;18.5 day pregnancy
5 GSM1480295 mammary gland;luminal cells;2 day lactation
6 GSM1480296 mammary gland;luminal cells;2 day lactation
To transform this table so that it conforms to the tidy principles, we use the separate_wider_position()/separate_wider_delim() function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.
The separate_wider_delim() function takes three arguments:
- cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
- delim = : delimeter (or separator) between values. This is same as the
delim =inread_delim(). - names = : a vector containing column names for the the new columns.
To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the separate_wider_delim() function:
metadata_lform <- metadata |>
separate_wider_delim(cols = characteristics,
delim =";",
names = c("tissue_type", "immunophenotype", "development_stage"))
metadata_lformOutput
# A tibble: 12 × 4
gene_id tissue_type immunophenotype development_stage
<chr> <chr> <chr> <chr>
1 GSM1480291 mammary gland luminal cells virgin
2 GSM1480292 mammary gland luminal cells virgin
3 GSM1480293 mammary gland luminal cells 18.5 day pregnancy
4 GSM1480294 mammary gland luminal cells 18.5 day pregnancy
5 GSM1480295 mammary gland luminal cells 2 day lactation
6 GSM1480296 mammary gland luminal cells 2 day lactation
7 GSM1480297 mammary gland basal cells virgin
8 GSM1480298 mammary gland basal cells virgin
9 GSM1480299 mammary gland basal cells 18.5 day pregnancy
10 GSM1480300 mammary gland basal cells 18.5 day pregnancy
11 GSM1480301 mammary gland basal cells 2 day lactation
12 GSM1480302 mammary gland basal cells 2 day lactation
The separate_wider_position() function splits at fixed widths and takes two arguments:
- cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
- widths = : a named vector containing numbers where the names become the new column names and values specify the column widths.
For instance, we can divide the gene_id column into three separate columns to evaluate the functionality of this operation (this is provided purely as an example):
metadata_lform |>
separate_wider_position(cols = gene_id,
widths = c(code = 3, prefix = 4, id = 3))Output
# A tibble: 12 × 6
code prefix id tissue_type immunophenotype development_stage
<chr> <chr> <chr> <chr> <chr> <chr>
1 GSM 1480 291 mammary gland luminal cells virgin
2 GSM 1480 292 mammary gland luminal cells virgin
3 GSM 1480 293 mammary gland luminal cells 18.5 day pregnancy
4 GSM 1480 294 mammary gland luminal cells 18.5 day pregnancy
5 GSM 1480 295 mammary gland luminal cells 2 day lactation
6 GSM 1480 296 mammary gland luminal cells 2 day lactation
7 GSM 1480 297 mammary gland basal cells virgin
8 GSM 1480 298 mammary gland basal cells virgin
9 GSM 1480 299 mammary gland basal cells 18.5 day pregnancy
10 GSM 1480 300 mammary gland basal cells 18.5 day pregnancy
11 GSM 1480 301 mammary gland basal cells 2 day lactation
12 GSM 1480 302 mammary gland basal cells 2 day lactation
Uniting Columns
The unite() function is the complement of separate(). Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.
The unite() function takes three arguments:
- col = : name of the new column that will contain the united values.
- … = : a vector containing column names to unite.
- sep = : delimeter (or separator) this is same as the
delim =inread_delim(). If we don’t specify a separator to insert between the combined values, they will be separated by _ (underscores).
To separate characteristic column in the metadata data frame into three separate columns:
metadata_lform |>
unite(col = characteristics,
tissue_type, immunophenotype, development_stage,
sep = ",")Output
# A tibble: 12 × 2
gene_id characteristics
<chr> <chr>
1 GSM1480291 mammary gland,luminal cells,virgin
2 GSM1480292 mammary gland,luminal cells,virgin
3 GSM1480293 mammary gland,luminal cells,18.5 day pregnancy
4 GSM1480294 mammary gland,luminal cells,18.5 day pregnancy
5 GSM1480295 mammary gland,luminal cells,2 day lactation
6 GSM1480296 mammary gland,luminal cells,2 day lactation
7 GSM1480297 mammary gland,basal cells,virgin
8 GSM1480298 mammary gland,basal cells,virgin
9 GSM1480299 mammary gland,basal cells,18.5 day pregnancy
10 GSM1480300 mammary gland,basal cells,18.5 day pregnancy
11 GSM1480301 mammary gland,basal cells,2 day lactation
12 GSM1480302 mammary gland,basal cells,2 day lactation
Missing Values
A value can be missing in one of two possible ways:
- Explicitly, meaning it is flagged with NA.
- Implicitly, implying that it is just not present in the data.
Let’s illustrate this idea with a very simple data frame:
covid_vac <- data.frame(
year = c(2020, 2020, 2021, 2021, 2021, 2023, 2023,
2023, 2024, 2024),
vaccine_type = c("Pfizer", "Moderna", "Pfizer", "Moderna", "Novavax",
"Pfizer", "Moderna",
"Novavax", "Moderna", NA),
count = c(0, 3, 63, 88, 51,
38, 19,
5, 9, 7)
)
covid_vacOutput
year vaccine_type count
1 2020 Pfizer 0
2 2020 Moderna 3
3 2021 Pfizer 63
4 2021 Moderna 88
5 2021 Novavax 51
6 2023 Pfizer 38
7 2023 Moderna 19
8 2023 Novavax 5
9 2024 Moderna 9
10 2024 <NA> 7
In this dataset, we identify two occurrences of missing values:
- The vaccine_type in 2024 with a count of 7 is explicitly missing, denoted by the presence of NA in the cell where its value should be.
- The counts for the Novavax vaccine in 2020 and Pfizer, Novavax vaccines in 2024, are implicitly missing, as they do not appear in the dataset at all.
is.na()
To identify missing values we can use is.na() function which returns a logical vector with TRUE in the element locations that contain missing values represented by NA.
is.na(covid_vac)Output
year vaccine_type count
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE TRUE FALSE
is.na(covid_vac$vaccine_type)Output
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
To identify the location or the number of NAs we can use the which() and sum() functions:
which(is.na(covid_vac))
sum(is.na(covid_vac))Output
[1] 20
[1] 1
na.omit()
To omit all rows containing missing values, we can use na.omit() function in base R:
na.omit(covid_vac)Output
year vaccine_type count
1 2020 Pfizer 0
2 2020 Moderna 3
3 2021 Pfizer 63
4 2021 Moderna 88
5 2021 Novavax 51
6 2023 Pfizer 38
7 2023 Moderna 19
8 2023 Novavax 5
9 2024 Moderna 9
complete()
We can use the complete() function to make our dataset more complete or to make missing values explicit in tidy data:
covid_vac |> complete(year, vaccine_type)Output
# A tibble: 16 × 3
year vaccine_type count
<dbl> <chr> <dbl>
1 2020 Moderna 3
2 2020 Novavax NA
3 2020 Pfizer 0
4 2020 <NA> NA
5 2021 Moderna 88
6 2021 Novavax 51
7 2021 Pfizer 63
8 2021 <NA> NA
9 2023 Moderna 19
10 2023 Novavax 5
11 2023 Pfizer 38
12 2023 <NA> NA
13 2024 Moderna 9
14 2024 Novavax NA
15 2024 Pfizer NA
16 2024 <NA> 7
This function add missing values for potential combinations of year and vaccine_type. One problem is that R assumes NA in status as one of the combinations. To fix this, we can specify the labels of status to be considered as follows:
covid_vac |> complete(year, vaccine_type = c("Pfizer", "Moderna", "Novavax"))Output
# A tibble: 13 × 3
year vaccine_type count
<dbl> <chr> <dbl>
1 2020 Moderna 3
2 2020 Novavax NA
3 2020 Pfizer 0
4 2021 Moderna 88
5 2021 Novavax 51
6 2021 Pfizer 63
7 2023 Moderna 19
8 2023 Novavax 5
9 2023 Pfizer 38
10 2024 Moderna 9
11 2024 Novavax NA
12 2024 Pfizer NA
13 2024 <NA> 7
We can use the fill argument to assign the fill value:
covid_vac |> complete(year,
vaccine_type = c("Pfizer", "Moderna", "Novavax"),
fill = list(count = 0))Output
# A tibble: 13 × 3
year vaccine_type count
<dbl> <chr> <dbl>
1 2020 Moderna 3
2 2020 Novavax 0
3 2020 Pfizer 0
4 2021 Moderna 88
5 2021 Novavax 51
6 2021 Pfizer 63
7 2023 Moderna 19
8 2023 Novavax 5
9 2023 Pfizer 38
10 2024 Moderna 9
11 2024 Novavax 0
12 2024 Pfizer 0
13 2024 <NA> 7
We can use the full_seq() function from tidyr to fill out the data frame with all years from 2020 to 2024 and assign vaccination types and count values of 0 to those years and for which there was no observation.
covid_vac |> complete(year = full_seq(year, period = 1),
vaccine_type = c("Pfizer", "Moderna", "Novavax"),
fill = list(count = 0))Output
# A tibble: 16 × 3
year vaccine_type count
<dbl> <chr> <dbl>
1 2020 Moderna 3
2 2020 Novavax 0
3 2020 Pfizer 0
4 2021 Moderna 88
5 2021 Novavax 51
6 2021 Pfizer 63
7 2022 Moderna 0
8 2022 Novavax 0
9 2022 Pfizer 0
10 2023 Moderna 19
11 2023 Novavax 5
12 2023 Pfizer 38
13 2024 Moderna 9
14 2024 Novavax 0
15 2024 Pfizer 0
16 2024 <NA> 7
fill()
The fill() function is used to fill missing values in a data frame, particularly within columns.
Let’s first make missing values in the covid_vac dataset explicit and assign it to a data frame named covid_vac_comp.
covid_vac_comp <- covid_vac |>
complete(year = full_seq(year, period = 1),
vaccine_type = c("Pfizer", "Moderna", "Novavax"))We can specify the direction to fill the missing values using the argument .direction. Remember to specify the list of columns to fill.
covid_vac_comp |> fill(count, .direction = "down")Output
# A tibble: 16 × 3
year vaccine_type count
<dbl> <chr> <dbl>
1 2020 Moderna 3
2 2020 Novavax 3
3 2020 Pfizer 0
4 2021 Moderna 88
5 2021 Novavax 51
6 2021 Pfizer 63
7 2022 Moderna 63
8 2022 Novavax 63
9 2022 Pfizer 63
10 2023 Moderna 19
11 2023 Novavax 5
12 2023 Pfizer 38
13 2024 Moderna 9
14 2024 Novavax 9
15 2024 Pfizer 9
16 2024 <NA> 7
Similarly, we can fill upwards as follows:
covid_vac_comp |> fill(count, .direction = "up")Output
# A tibble: 16 × 3
year vaccine_type count
<dbl> <chr> <dbl>
1 2020 Moderna 3
2 2020 Novavax 0
3 2020 Pfizer 0
4 2021 Moderna 88
5 2021 Novavax 51
6 2021 Pfizer 63
7 2022 Moderna 19
8 2022 Novavax 19
9 2022 Pfizer 19
10 2023 Moderna 19
11 2023 Novavax 5
12 2023 Pfizer 38
13 2024 Moderna 9
14 2024 Novavax 7
15 2024 Pfizer 7
16 2024 <NA> 7
Once the data is structured and organized according to tidy principles, we can begin manipulating and transforming it. The next section illustrates how this can be accomplished using the dplyr package from the tidyverse package suit.